Hacking for Defense: An Effort to Create a Cybersecurity Risk Evaluation Metric

Data Science II: Applied Statistical Learning - Spring 2022 - Georgetown University
Ryan Ripper
05/09/2022

1. Introduction

Is there a way to systematically identify a company as being a potential cybersecurity target from online attacks? Cybersecurity is the practice of protecting systems, networks, and programs from digital attacks. Attacks on online platforms usually attempt to access, change, or destroy sensitive information; extort money from users; or interrupt normal business processes. Cybersecurity attacks also have the capacity to suppress certain social and political activity, to steal intellectual property, and to harm regional and international adversaries.

This paper starts with our collaboration with Hacking for Defense (H4D) Spring 2022. H4D is a class offered at Georgetown University sponsored by the Department of Defense that teaches students to work with the Defense and Intelligence Communities to rapidly address the nation’s emerging threats and security challenges. Students work with sponsors to address government concerns in solving specific, mission critical problems while allowing students to investigate their interests in the given field. This semester’s class has been tasked with creating a standardized evaluation metric to address the cybersecurity risk posed by defense companies working with the government. This metric will work to prioritize mitigation methods and to prevent cybersecurity incidents for the National Security Agency (NSA) Defense Industrial Base (DIB).

The NSA hopes to alleviate the burden of evaluating over 100,000 DIB companies by identifying those who present the greatest cybersecurity target. The H4D project plans to scale the services the NSA currently conducts on a small-scale cyber security-as-a-service pilots to DIB companies. There is currently no standardization evaluation metric of business cybersecurity risk while little is known about the business risk posed by companies not in the insurance industry. Thus, the class's proposal and our research question both plan to address the complexity of the problem posed by the NSA in identifying companies that would benefit from additional cybersecurity support. This support will contribute to critical technology suppliers that would seemingly pose as financial, political, technical, and/or military targets. The policy implications related to this project involve the proper identification and allocation of resources from government entities to protect the interests of American development for critical and emerging technologies, which have been designated by the Pentagon as instruments that could dictate the outcomes of future conflicts.

In order to answer our primary research question, we examine contracts and the corresponding awards given by the United States government. With these contracts can we understand the significance of a company within the development of an emerging technology. We start our analysis by focusing on hypersonic technologies. This technology has the potential to further destabilize nations with the speedy deployment of nuclear missiles, is uniquely situated as a “lagging” technology with respect to the United States amongst its foreign adversaries, and is a personal interest to the authors with respect to advanced aerospace technologies and applications. These hypersonic awards and all contracts in general are made available by USAspending which is the official open data source of federal spending information. We focus on published subawards as these provide further insight to all developers, not just the main contributors. By evaluating at the subaward scale, we expand our scope of companies involved in hypersonic technology development outside the main developers such as Raytheon, Northrop Grumman, and Lockheed Martin.

We can supplement the USAspending data by including other features to the main dataset. In our attempt to collect additional data to map to our data from USAspending, we have two options: to match on a company's Data Universal Numbering System, or D-U-N-S, number which is a proprietary identifier given by the firm Duns & Bradstreet or on a company's name. Matching on a company's D-U-N-S number relies on using the Duns & Bradstreet Application Programming Interface (API), with which we did not have any success. Matching on a company's name relies on consistent naming conventions across multiple sources, which we cannot accurately trust. Thus, our efforts lay in the use of subaward contract information provided by USAspending. We can also understand our analysis as an effort to stretch the information we can learn from subaward contract data. We evaluate data from USAspending from the calendar year 2021 so that we have the latest full years’ worth of contract information.

Working with H4D, we were tasked as data scientists to data wrangle and develop machine learning models to identify companies that could be cybersecurity targets and to develop a risk evaluation metric as a continuously monitoring tool. Data wrangling provides an initial effort to identify at-risk companies. Machine learning will help us develop a model to continue our identification of at-risk companies and to hopefully predict attacks.

However, without additional information outside USAspending, our effort to create a cybersecurity risk evaluation metric stalls as we do not have enough identifiers to predict a potential attack since we do not have data on past cybersecurity attacks. We cannot create supervised prediction models to anticipate these attacks since we do not have a specific dependent variable with which to train our models. Since there are no federal laws requiring the disclosure of events like data breaches, obtaining a list of past cybersecurity attacks will be incredibly challenging. Furthermore, if we were to have a list of past cybersecurity attacks across the companies included in USAspending, there would be a select number of companies that had been victims of attacks. We would have an extremely imbalanced class distribution amongst those companies who had been attacked versus those who had not. We could address this imbalance by oversampling the positive class (those who had been attacked) but would have these observations repeated many times over, resulting in overfitting for some models.

With this obstacle came our decision to pivot our focus to identifying companies that would first be prone to cyber security attacks, dictating a reason to support these companies with government response and then to find any anomalies in the data. Our subsequent results and analysis focus on comparisons between all subaward contracts and hypersonic subaward contracts. We develop unsupervised machine learning models to gain further insight to the USAspending data.

Ultimately, we were able to identify DIB companies that could be cybersecurity targets. These companies exhibited a lacking focus on Information Technology infrastructure and attracted specific attention regarding their accomplishments in hypersonic technology across domestic and international media outlets.

2. Main Results and Analysis

We begin by first loading in the data from USAspending for all subawards given in the calendar year 2021.

We distinguish hypersonic subawards from all contracts.

Table 1. Top 10 Subaward Descriptions for Hypersonic Contracts

Table 1 shows the top descriptions with the most number of subawards given. These descriptions are lengthy and do not necessarily give much insight into the contract itself nor the work the company is doing to contribute to hypersonic technology development. We can extend these results to find the companies that have these descriptions in common with their contracts and find those companies with the top number of awards.

Table 2. Companies with the Most Contracts for Top Subaward Descriptions for Hypersonic Contracts

Table 2 takes our proposed step from Table 1 to identify the companies that have the most common descriptions across the hypersonic contracts. One company that stands out in this list is Optical Sciences Corporation. Optical Sciences Corporation manufactures custom sensor test equipment and provides high quality technical and engineering support services to the U.S. Army, NASA, and private aerospace/defense corporations. The company was founded in 1995 and is headquartered in Huntsville, Alabama. The company has less than \$5M in revenue and less than 25 employees. The company is expected to spend \\$378.2K on Information Technology. The company has 9 total subawards given in 2021 where those awards average a value of \$120,992.96. Optical Sciences Corporation is an example of a small sized firm that is actively engaged with a high number of hypersonic contracts where those contracts have an associated description that is observed many times across USAspending. This company serves as an example of one that could serve from additional support as a DIB company with its relative size and importance in development.

Table 3. Top 10 Subaward Descriptions for All Contracts

Table 3 shows the top subaward descriptions across all contracts in the calendar year 2021. This table gives us insight into the importance of hypersonic/aerospace technologies being developed. The United States government has awarded a high number of contracts to companies working in aerospace and aerospace related fields, signifying an increased importance in developing this technology.

Table 4. Companies with the Most Contracts for Top Subaward Descriptions for All Contracts

We do a similar analysis as we did in Table 2 but now for all contracts. One company that stood out across all contracts with descriptions that are used frequently is Bendix Commercial Vehicle Systems LLC. As we were investiating the company, we came across their website and saw they did not have a valid Secure Sockets Layer (SSL) certificate for several days (April 11-12, 2022). Without a valid SSL certificate, any computer could intercept a transmission from a browser to the company's server(s). An SSL certificate shows ownership over a domain. Without a valid SSL certificate, the website could be replicated and used as a mechanism for phishing, non-payment/non-delivery, and personal data breaches. We have identified another company outside the hypersonic space that would benefit from additional support. Our analysis from Table 2 can be extended from hypersonic contracts to all contracts.

Figure 1. Observe Missing Values in All Subaward Contract Data

Figure 1 visualizes the missing data in the USAspending data for all contracts. We have limited the data to fewer variables from the original full dataset. The variables that have the most missing values are prime_awardee_parent_name, subawardee_duns, and subawardee_business_types. Each of these features represents variables that will need to be converted to categorical variables. We do further analysis to determine how we can handle the missing values.

Table 5. Count of Missing Values for Each Variable for All Contracts

The variable that has the most missing values is subawardee_business_types. We find that 9.397% of all observations have missing values. Since this is a small portion relative to the total subaward data, we can drop all observations that have missing values from our dataset to continue with our analysis.

Figure 2. Number of Subawards Against Average Subaward Amount for Hypersonic Contracts

When we plot the number of subawards against average subaward amount for hypersonic contracts in calendar year 2021, we observe a majority of the contracts are clustered together. However, there are two observations that serve as outliers in both dimensions. The company that represents the outlier for average subaward amount is Honeywell International Inc. The company that represents the outlier for number of subawards is Carahsoft Technology Corp. Of these two companies, Carahsoft Technology Corp. stands out as it provides "government IT solutions", including its solutions in cybsersecurity. With so many subawards given to Carahsoft Technology Corps., the United States government shows its increased interest in digital solutions in the hypersonic space.

Figure 3. Number of Subawards Against Average Subaward Amount with K-Means Clustering Results for Hypersonic Contracts

We use K-Means Clustering to identify any group relationships between the subawards within the USAspending data for hypersonic contracts. K-Means Clustering partitions the observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. Figure 3 shows the results of the clustering unsupervised algorithm over the plot of number of subawards against average subaward amount. We use 5 clusters in the K-Means Clustering algorithm. Our results show the clustering works through the average subaward amount and not through the number of subawards. We cannot gain much insight into the results of the K-Means Clustering since our grouping is only seen against one variable. If we were to convert the subaward amount into an ordinal variable, we would most likely observe the same binning as we do in the results from the K-Means Clustering as we would have defined the bins for the ordinal variable.

Figure 4. Results of T-SNE Clustering Model with Dimension Reduced Data for Hypersonic Contracts

We visualize the high-dimensional data using a t-distributed stochastic neighboring embedding and then map the results of the K-Means Clustering onto the reduced plot, as seen in Figure 4. Since we reduced the dimensionality of the data to two dimensions, it is difficult to identify true outliers among the clusters. We observe from the results of the K-Means Clustering that a clear clustering occurs once we reduce the dimensionality of the data but we are unable to make any clear sense of the results since the plot labels are not well defined. The least we can say is the grouping to the far right may be a group of outliers within the "purple" clustered group. We can do further analysis to determine if these observations in this group are indeed anomalies.

Figure 5. Anomaly Detection Model Results for Hypersonic Contracts - 3D T-SNE Plot for Outliers

Figure 5 represents the results of the anomaly detection model using hypersonic subaward contract data. We use anomaly detection to identify rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behaviour. We create an Isolation Forest where the algorithm differentiates observations by randomly selecting a feature and then randomly selecting split values between the maximum and minimum values. We observe several anomalies within our set of hypersonic subawards which we will explore in further detail. Since we again reduced the dimensionality of the data to now three dimensions, it is still difficult to understand what makes these anomalies true outliers within the dataset.

Table 6. Anomalies by Company Name for Hypersonic Contracts

We identify the companies associated with the anomalies we detected in the anomaly detection model. Lockheed Martin Corporation stands out across these companies as a relatively large company across all subawardees. We expect to see Lockheed Martin as an important contributor in the hypersonic space and would always be a target for cybersecurity attacks simply based on its logistical position as one of the largest aerospace developers in the United States. As for the other companies, quick Google searches showed each company's significance in hypersonic development. For example, several recent articles discuss CUBRC, Inc. receiving large contracts from the Department of Defense. HySonic Technologies, LLC. have several articles published in the past month (April 2022) detailing its contribution to reusable hypersonic flight testing, a critical part in the development of the entire technology.

Table 7. Anomalies by Subaward Description for Hypersonic Contracts

We identify the subaward descriptions associated with the anomalies we detected in the anomaly detection model. These results do not give particular insight regarding the anomalies and why they were detected. We can further this analysis by mapping these descriptions to all hypersonic contracts to see if we observe similar company specific results.

Table 8. Companies with the Most Contracts for Top Subaward Descriptions for Hypersonic Anomalies

Once we match the anomaly descriptions to the descriptions across the set of hypersonic subawards, we observe similar company results for when we identified the companies directly from the anomaly detection algorithm. We can assume the descriptions themselves are what caused the anomalies to appear in the first place. However, we will need to do additional anomaly detection to see if there is a true link between subaward description and the identified anomalies. As a counterfactual assumption, the descriptions associated with the anomalies are general for the most part, which could contradict our previous assumption that these descriptions could lead to observations being identified as anomalies since a general description could be used for many contracts across all subawards, let alone hypersonic ones.

3. Conclusion and Final Remarks

Even though our initial goal of creating a cybersecurity risk evaluation metric was not successful, we identified companies that serve as critical contributors to hypersonic technology development that could act as cybersecurity targets from American adversaries. Our analysis focuses on data wrangling to find specific companies that could favor from additional government support within the digital sphere. Of the companies we identified, the ones that stand out the most with respect to the data analysis, machine learning, and associated media coverage are Optical Sciences Corporation, Bendix Commercial Vehicle Systems LLC, and HySonic Technologies, LLC. Of the companies that are identified and assumed to have a critical role in hypersonic development and should already have significant infrastructure in place to handle cybersecurity attacks, Lockheed Martin Corporation shares slightly under 20% of the entire aerospace market as of the end of the calendar year 2021. Our analysis was successful in both identifying companies that could benefit from additional support and companies that should have cybersecurity measures since they represent large portions of the respective market.

One specific area of concern is the small number of observations we used in our unsupervised machine learning algorithms. The algorithms could perform better with a larger number of subawards, but we must be wary of model bias and variance. Another consideration is whether or not the results are truly tied to cybersecurity issues. The results indeed found companies that could be targets. However, these companies could also be examples of some other unknown factor(s) within the data.

Even though we focused our analysis to hypersonic subaward contracts, our findings show that our approach can be extended to any other emerging technology the United States government has shown interest. Such other technologies include artificial intelligence, quantum information technologies, biotechnologies, and space technologies and systems.

To continue in our initial effort to create such a cybersecurity risk evaluation metric, I aim to meet with NSA officials who originally sponsor the Hacking for Defense team. I hope to gain additional insight as to new ways to supplement the USAspending data and to explore other venues of analysis. Their insights could contribute to developing a dashboard where the emerging and critical technology could be selected and companies, descriptions, business types, etc. are highlighted as those with increased risks of cybersecurity attacks. Such a dashboard could reduce unnecessary efforts to analyze large amounts of data in favor of a readily available system at the disposal of the United States government and other cybersecurity participants.

I would like to thank Professor Gregory Lyon for his introduction to the Hacking for Defense program and his support throughout this project. I would also like to thank Thomas Keelan who was my partner within the Hacking for Defense program. Thomas helped narrow my focus and motivated my approach to the question at hand.